{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "026d740f-0a39-45ae-b099-d344e1160095",
   "metadata": {},
   "source": [
    "## 数据分析\n",
    "### 描述性统计分析\n",
    "中心位置可知数据的平均情况: mean均值(样本和/样本数), median中位数(数值排序后中间位置的数), mode众数(数量最多的值);\n",
    "常用函数: describe(), size, sum(), mean(), var(), std()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4648743-f78f-470b-84a1-74fea202eb20",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.set_option('display.max_columns', 20, 'display.max_rows', 10,'display.float_format', lambda x: '%.2f' %x)\n",
    "df = pd.read_csv('df.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99ef8a91-b894-4d5d-9038-db10a90f20e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.成交量.describe()  # 数量, 均值, 标准差, 排序后各比例的数据超过的值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d122722d-bca1-49b9-993d-7c408229d615",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.成交量.size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb2c795d-2d56-43b3-9312-d064b6a56574",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.成交量.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b31d7d74-9bca-4eb8-97e8-1f9df50b6df0",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.成交量.std()  # 标准差"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70d2b9c2-c028-483c-b5b0-e79913acc3f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.成交量.var()  # 方差"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67f3b9fc-094a-40d4-85bb-773d4022c11c",
   "metadata": {},
   "source": [
    "### 分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17668aef-691b-485a-a762-cce89eadc893",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['股票名称', '股票代码'])['成交量'].mean()\n",
    "# .agg([('开盘', np.max), ('成交量', np.std.), ('涨跌幅', np.mean)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e34266bb-5775-4a01-a92a-479fa1d31a2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(by=['股票名称', '股票代码'])['成交量'].agg([('最大值', 'max'), ('标准差', 'std'), ('均值', 'mean')])  # 对成交量取各种统计结果"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df58d2-9ac2-40f0-abf6-01c7e3987a8c",
   "metadata": {},
   "source": [
    "### 分布\n",
    "直方图描述各值出现的频数或概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4b8537e-7e55-4745-b733-e3d30fb2b83a",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['交易量'] = pd.cut(df.成交量, bins=[df.成交量.min() - 1, 900000, 1300000, 1500000, df.成交量.max() + 1], labels=['少量', '中量', '大量', '巨量'])\n",
    "df.groupby('交易量', observed=False)['成交量'].agg([('天数', 'size')]) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78cb29c6-60f1-4739-a0bf-57075686598e",
   "metadata": {},
   "source": [
    "### 分组交叉\n",
    "pivot_table 使用表格展示有一定联系的数据统计信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "197e3449-baab-40dd-a193-36f28594d1ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "# values 用于统计的样本数据, index 索引列, columns 分组序列, aggfunc 样本数据的指标统计函数\n",
    "df.pivot_table(values=['成交量'], index=['交易量'], columns=['股票代码','日期'], aggfunc=['size'], observed=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85838c1a-f8dc-4382-8240-4de994957c07",
   "metadata": {},
   "source": [
    "### 相关性\n",
    "用相关系数定量描述两个数据集的相关性;  \n",
    "系数绝对值 [0, 0.3)低度相关, [0.3, 0.8)中度相关, [0.8, 1]高度相关;  \n",
    "- [corr 计算相关性系数](https://blog.csdn.net/qq_41721951/article/details/109645921)\n",
    "- [常用的特征选择方法之 Pearson 相关系数](https://guyuecanhui.github.io/2019/07/20/feature-selection-pearson/)\n",
    "- [常用的特征选择方法之 Spearman 相关系数](https://guyuecanhui.github.io/2019/07/28/feature-selection-spearman/)\n",
    "- [常用的特征选择方法之 Kendall 秩相关系数](https://guyuecanhui.github.io/2019/08/10/feature-selection-kendall/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6a1f9d4-47af-46b9-869a-3c25bd030f94",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['振幅'].corr(df['涨跌幅'])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
